Executive Summary

Many countries around the world provide nationwide dietary guidelines to the public and require nutrition labels on food products to help people make informed choices about foods and drinks they consume. Every individual, however, has different nutrition needs and preferences according to their age, sex, ethnicity, height, weight, and physical activity level, among many other factors. Therefore, people may benefit from more personalized dietary recommendations. Our goal is to provide the basis for an algorithm that recommends foods a person should consume based on their nutrition needs. To this end, we compared different approaches to grouping foods based on their nutrient profiles (i.e., spectrum clustering vs. k-means clustering). As expected, spectrum clustering provides better clustering, so we explore the clusters created through spectrum clustering further. Finally, we demonstrate how others can use our data-driven food groupings to meet their dietary needs.

Introduction

Importance of nutrition in affecting everyday life

One of the most important determinants of one’s health is their diet: the nutrients we consume can affect a range of our health outcomes, from cardiovascular disease (Shanta Retelny, Neuendorf, and Roth 2008) and obesity (Popkin and Gordon-Larsen 2004) to learning and general brain function (Dani, Burrill, and Demmig-Adams 2005). Unfortunately, many people around the world are not consuming enough of the nutrients that promote their health, and instead their diet consists mainly of too much of the nutrients that directly hurt their health (Popkin and Gordon-Larsen 2004). In many cases, across the United States and similarly wealthy countries, people are simply overwhelmed by the overabundance of food options available to them, so they end up choosing whatever is easiest and/or fastest to consume, leading to the popularity of “fast food”. Given the importance of choosing the nutrients in our diet, helping people narrow down the range of options for foods by creating categories of foods that are similar will likely free up time to make better diet decisions.

Problem with current food groupings

Although foods are already grouped together in different categories in many nutrition datasets, the means by which these food groupings are created is unclear, so food items in a certain group may not necessarily have the same nutrient profile. A new interpretation of food groupings can change the way diets are currently made - helping people make more “intelligent” diet decisions. More specifically, the results of this study can be used to find the list of foods that have the best combination of nutrients based on a person’s dietary needs.

Study goals

In the current project, we aim to provide the basis for an algorithm that recommends foods a person should consume based on their nutrition needs. In more concrete terms, we wanted to identify clusters of foods that have the highest intra-group similarity and lowest inter-group similarity based on their nutrient profile.

Methods

Exploratory data analysis

Our dataset originates from the most recent version of the Canadian Nutrient File (updated in 2015), provided on the Canadian government site. The database contains average values for nutrients in foods available in Canada, with much of the data coming from the USDA National Nutrient Database for Standard Reference. These averages are based on the generic versions of a food, unless there is a brand specifically included in the database. This is a bilingual dataset with food names, descriptions, and background information that are in both French and English. This version of the database was created to update nutrient values for foods that are the largest contributors of sodium to the diet, since one of the major goals of manufacturers is to reduce sodium content of foods. To this end, the database assesses more than 5690 unique foods, ranging from foods such as Cheese souffle to Vanilla extract, and provides the average nutrient levels per 100 grams.

The original data came as a set of separate datasets that had columns with identifiers (e.g., FoodID) to link them, so we then merged the datasets based on the unique identifiers and selected the main variables of interest: the food labels and the nutrient profile associated with a given food label. With this subsetted version of the data, we ran exploratory data analyses.

First, we noticed that some of the nutrients included are subcomponents of other nutrients. For instance, there is a total sugars column that is a sum of several other columns in the data. In other cases, there are columns that are essentially different ways of measuring the same nutrient. So for example, there are two metrics for food energy (i.e., kilocalories and kilojoules), where one kilocalorie is equal to 4.184 kilojoules. We decided to leave these variables as they were, since we expected the methods used in our analyses to be able to essentially ignore these redundancies.

Also, it is worth noting that there are many values that are missing. For instance, only 0.982% of the rows have values for biotin (see Missing values in clean dataset section). To be able to run our target analyses (which requires a dataset with only nonmissing data), we needed to remove the missing values. We explored two different options for removing missing variables to see whether they would affect the results: mean imputation (i.e., inserted the mean of a column into any rows in that column that are missing) for all variables or first removing variables with more than 50% of missing values and then using mean imputation on the survived variables. We found that the options produced similar results across our different options for mean imputation, so we used the version of the dataset where we used mean imputation for all variables in all subsequent analyses. See the appendix for a full summary of the variables in the cleaned dataset after using mean imputation for all variables.

Main analyses

Since our goal is to identify the best algorithm for grouping foods that will help a person plan out their meals to best suit their needs, we compared different approaches to grouping the foods based on their nutrient profiles. That is, we compared different methods to make sure foods are as similar as possible in terms of their nutrient profile (that is, we wanted to maximize in-group similarity), while also making sure foods assigned to different groups have very different nutrient profiles (in other words, minimize out-group similarity). The slides, code, and database associated with this project is publicly available here.

The first option we explored to achieve this goal was using k-means clustering on the dataset. K-means clustering is a commonly used unsupervised machine learning algorithm to divide a dataset into pre-specified set of groups (known as clusters in this context). The algorithm assigned foods to clusters based on which assignments will keep the distance between the nutrient profile of each food assigned to a cluster and the mean nutrient profile for a given cluster as low as possible. In other words, it tries to assign foods to clusters that will achieve the highest possible intra-group similarity for the cluster number that we specified. To identify the optimal number of clusters, we used the silhouette method - which essentially compares the quality of clusters produced across a different range of cluster sizes. To make this comparison, the silhouette method calculates the average silhouette width for a given cluster size, k. The silhouette coefficient is a measure of the average distance between clusters.

The second approach we explored was using spectrum clustering. Spectrum clustering adds another step right before the k-means clustering approach. Instead of using the raw data to create clusters, spectrum clustering minimizes the number of variables that we cluster on by including a principal components analysis (also known as PCA) right before the clustering analysis. PCA may be especially advantageous when there are a large number of variables you want to cluster on, because it tries to linearly combine the old variables into new ones (aka PC scores) in a way that maximizes the variance. By maximizing variance during PCA, we can reduce the dimensions of the data without losing much information from the original data. Once we had the PC scores, we used the silhouette method like before to identify the optimal number of clusters before running k-means clustering.

These were the two main clustering approaches that we compared. Since spectrum clustering combines the power of noise reduction through PCA with clustering analyses, we expected beforehand that it would be a better approach for grouping our data.

Results

Comparing k-means clustering to spectrum clustering

First approach: K means clustering on all available nutrients

The silhouette method recommended we use 8 clusters (see appendix), so we ran k-means clustering using kmeans() with centers set to 8. Once created, we explored some of the characteristics of the clusters. First, there is a wide range in the number of foods assigned to each cluster, with the largest cluster consisting of 3687 and the smallest cluster consisting of 9, and a median cluster size of 41. Ideally, the between-cluster sum of squares should be a large proportion of the total sum of squares (the total distance of all observations from the global center) (https://stats.stackexchange.com/questions/82776/what-does-total-ss-and-between-ss-mean-in-k-means-clustering), which would suggest that there is high separation between groups. In this case, we find that between-cluster sum of squares (3.681^{10}) explains more than half (67.812%) of the total sum of squares (5.429^{10}). The plot below shows what the clusters looked like using k-means clustering across two randomly chosen variables: carbs and calories, along with the mean values of each cluster for those variables.

Second approach: spectrum clustering on all available nutrients

Before we

Scaling data is more reasonable since all variables are measured in different units. (We might wanna just remove comparing PVE parts above and just explain that it conceptually makes more sense to center & scale data here)

Describe: - results from PCA (scree plot), elbow rule - characteristics of cluster - size of cluster, within & between group SS - plot of the “8 food clusters bsaed on 7 PCs”

The plot below shows what the clusters looked like using spectrum clustering across the first two PCs, along with labels for the food in each cluster with the highest absolute PC1 score. The separation between clusters here is more obvious, which is supported by the lower average of the total within-cluster sum of squares across clusters for the spectrum clustering, at 22017

Regular k-means clustering provides better clusters

To compare the k-means and spectrum clustering methods on the quality of clusters they produced, we used different metrics of internal cluster validation to see if one method produced consistently superior clustering results compared to the other. Internal cluster validation is one of three cluster validation statistics available to evaluate the quality of cluster results. We chose internal cluster validation metrics because it best aligns with our goal of assessing cluster quality without using an external point of comparison, which would be considered external cluster validation (i.e., comparing whether the food groups we created actually align with the foods groups listed in the dataset). We are more interested in creating new high quality clusters using nutrient profiles, rather than trying to predict which of the previously established food groups a food would fall into based on its nutrient profile.

The first internal validation metric we used is identical to the one we used to identify the optimal number of clusters before running our main clustering analyses: average silhouette width.

0.232

0.362

The second internal validation metric we used is known as the Dunn index.

Thus, our results support our hypothesis that spectrum clustering would provide much better groups of foods, maximizing the similarity of nutrient profiles of foods within any given cluster.

Exploring “intelligent” clusters

We plotted different words clouds of the most commonly used words within each cluster. So for instance, here you can see that the most commonly used word in cluster 2 by far was “raw”, along with “meat” “lean” and “beef”. this cloud tends to have more meats, so it will likely have a higher protein content compared to other clusters.

See appendix for the full set of word clouds for this clustering result

We also explored which groups tend to have the highest calorie content - and by far the highest calorie groups were clusters 3 and 4. So if someone was trying to reduce their overall calorie intake, it may be a good idea to avoid these food groups.

  • most important nutrients based on PCA (see appendix)

Spectrum clustering on theoretically-driven nutrients

Describe: - plot of the “clustering over two randomly chosen variables”

  • words clouds

  • mean kcal per cluster

  • most important nutrients based on PCA (see appendix)

Case study: Following a diet to gain muscle

After determining the best clustering approach based on food nutrient profile, we provide a case study to serve as examples of how this tool can be used to make recommendations for a person who has specific dietary needs. Arnold is an aspiring professional bodybuilder that is interested in gaining muscle as quickly as possible. Since he knows the importance of nutrition in affecting the muscle-building process, he plans to change his diet to achieve this goal and has decided to follow the guidance he found at the links here and here. Since Arnold is currently 180 pounds and a person that wants to gain muscle is recommended to consume 1.5 grams of protein per pound of bodyweight, his target daily intake of protein is 270 grams per day. It is also recommended that people consume between 2-3 grams of carbohydrates per pound to gain muscle. Therefore, Arnold aims to consume between 360 and 540 grams of total carbohydrates per day. Fat is another important macronutrient that will affect Arnold’s muscle-building process. Since the articles he found suggest between 20 and 30 percent of his calories should come from fat, he aims to consume X grams of fat per day. Finally, it is recommended that he consume at least 3600 kilocalories per day to gain muscle. There are also several other nutrients that he would like to consume at above 2 standard deviations above the mean based on their muscle-building properties including: calcium, biotin, iron, vitamin C, selenium, Omega 3, Vitamin D, vitamin B12, copper, magnesium, riboflavin, and zinc.

To simulate Arnold, we created a copy of the dataset from our main analyses and compared Arnold’s daily nutrient targets to the daily nutrient recommendations for someone in the general population who matches him on all characteristics except for their level of activity. To determine a point of comparison for Arnold’s nutrient intake, we used the DRI Calculator for Healthcare Professionals from the National Argicultural Library. This tool calculates daily nutrient recommendations based on the Dietary Reference Intakes (DRIs) established by the Health and Medicine Division of the National Academies of Sciences, Engineering and Medicine, representing the most current scientific knowledge on nutrient needs. We inserted gender (Male), age (20), height (5 feet 10 inches), weight (180 pounds), and activity level (sedentary) for someone who matches Arnold on all characteristics but activity level, since Arnold will be above the mean on activity level compared to the typical person. Therefore, the nutrient recommendations provided by this tool are a representation of someone who matches Arnold on all characteristics, but does not aspire to be a pro bodybuilder like he does (and hence has a lower activity level). The metric of comparison we used is how much larger (or smaller) Arnold’s nutrient goals are compared to the recommendations from the DRI Calculator. For instance, Arnold is aiming to consume 3600 kilocalories per day, while it is generally recommended to consume 2734 kilocalories per day for someone like him but is less active. Therefore, Arnold is consuming approximately 1.32 (3600/2734) times the amount of kilocalories recommended. Based on these calculations, Arnold will be consuming 1.2 times the amount of total carbohydrates recommended, 4.15 times the amount of total carbohydrates recommended, 270 times the amount of total saturated fats recommended, and 10.8 times the amount of total total fat recommended. We created nutrient data for Arnold by multiplying the average value of the target nutrient across the entire dataset by the amount Arnold is consuming above the mean (e.g., 1.32 * average value of nutrient across all foods) for kilocalories, total carbohydrates, saturated fats, protein, and total fat. On the other hand, for the list of nutrients above where Arnold generally knows he wants to consume at least 2 standard deviations above the mean (i.e., calcium, biotin, iron, vitamin C, selenium, Omega 3, Vitamin D, vitamin B12, copper, magnesium, riboflavin, and zinc), we calculated and inserted the value representing 2 standard deviations above the mean into Arnold’s nutrient data.

To provide our recommendation, we tried to identify the cluster that most closely matched Arnold’s nutrient needs. That is, we identified the cluster that had the smallest euclidean distance from Arnold’s goal nutrient profile. Based on our analyses, Arnold would best achieve his goals by eating foods in cluster 5. So we recommend Arnold target foods like, turkey, brussel sprouts, and pork, all of which are found in cluster 5.

Conclusion

Our results contradict our a priori hypothesis that spectrum clustering would provide much better groups of foods compared to k-means clustering. Using various metrics of internal clustering validation, we find that k-means clustering serves as a better method with this dataset for creating clusters. We explored the final set of clusters through various plots of the most frequently used words in each cluster, along with average calorie count per cluster and most important nutrients in each cluster.

Our case study with Arnold provides an example of how these results can be used in a real-world context. Although we chose to focus on one specific example of the practical uses of this work, there are a number of other contexts that are relevant. For instance, these results can help patients of chronic illness identify groups of foods that help improve their medical condition. In another case, our results can be used to cluster new foods recently approved by the FDA based on their nutrient content, which may be helpful in grocery stores to determine food placement. On a more individual level, if a person likes a specific food for its nutrient content, they can use our clustering result to help them identify similar foods that they may enjoy. Finally, these results can be used when a person is following a recipe and wants to use a food from a specific cluster, but does not have that ingredient readily available. Using our clustering results, they can identify the foods that are most similar to a missing ingredient replace it easily.

Limitations

One of the more notable limitations of this work is that the Canadian Nutrient File only provides average amounts of the nutrients for a combination of all available versions of a given food (e.g., average sugar content of all brands of ketchup). Therefore, it is possible that the clustering results here may not be useful in cases where the nutrient profile of a specific brand of food deviates far from the mean. Also, this dataset is only relevant to products available in Canada - so the results cannot be generalized to products from other countries. Therefore, future research should explore whether these findings replicate among products in other countries. Another feature of the dataset that may be considered a limitation depending on a person’s needs is that the nutrient values are all standardized and thus may not be representative of how much a person may actually consume in a package. A person would need to convert the nutrient values based on the actual portion sizes they eat if they wanted to accurately assess many of their daily nutrient goals their hitting. Finally, in the case study, we had only a proxy for nutrient recommendations based on target nutrient goals. More specifically, we used daily recommended nutrients as the proxy to scale Arnold’s nutrient amounts when we created the fake data, but all of the nutrients in our dataset are at the level of an individual food. Therefore, using this tool with food-specific goals may be more appropriate, since our calculations were at a more aggregate level (i.e., “how much above the average are Arnold’s nutrient goals?” rather than “how much above the average is this specific food?”). Overall, there are a number of fruitful avenues for future work to extend and improve upon our current analyses.

Appendix

Full summary of clean dataset

Data Frame Summary

data_mean_imp

Dimensions: 5690 x 153
Duplicates: 0

No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 PROTEIN
[numeric]
Mean (sd) : 11.1 (10.8)
min < med < max:
0 < 7.6 < 85.6
IQR (CV) : 16.6 (1)
2261 distinct values 0
(0.0%)
2 FAT (TOTAL LIPIDS)
[numeric]
Mean (sd) : 10 (16.7)
min < med < max:
0 < 3.8 < 100
IQR (CV) : 11.6 (1.7)
1913 distinct values 0
(0.0%)
3 CARBOHYDRATE, TOTAL (BY DIFFERENCE)
[numeric]
Mean (sd) : 22 (26.5)
min < med < max:
0 < 10.3 < 100
IQR (CV) : 31.6 (1.2)
2756 distinct values 0
(0.0%)
4 ASH, TOTAL
[numeric]
Mean (sd) : 1.9 (3.5)
min < med < max:
0 < 1.2 < 99.8
IQR (CV) : 1.2 (1.8)
647 distinct values 0
(0.0%)
5 ENERGY (KILOCALORIES)
[numeric]
Mean (sd) : 219 (174)
min < med < max:
0 < 174 < 902
IQR (CV) : 240 (0.8)
665 distinct values 0
(0.0%)
6 ALCOHOL
[numeric]
Mean (sd) : 0.1 (1.8)
min < med < max:
0 < 0 < 42.5
IQR (CV) : 0 (14.3)
38 distinct values 0
(0.0%)
7 MOISTURE
[numeric]
Mean (sd) : 55 (31)
min < med < max:
0 < 64.7 < 100
IQR (CV) : 50.3 (0.6)
3417 distinct values 0
(0.0%)
8 CAFFEINE
[numeric]
Mean (sd) : 3.9 (97.8)
min < med < max:
0 < 0 < 5714
IQR (CV) : 0 (24.9)
62 distinct values 0
(0.0%)
9 THEOBROMINE
[numeric]
Mean (sd) : 7 (74.8)
min < med < max:
0 < 0 < 2634
IQR (CV) : 0 (10.7)
131 distinct values 0
(0.0%)
10 ENERGY (KILOJOULES)
[numeric]
Mean (sd) : 915 (727)
min < med < max:
0 < 727 < 3774
IQR (CV) : 1006 (0.8)
1659 distinct values 0
(0.0%)
11 SUGARS, TOTAL
[numeric]
Mean (sd) : 7.7 (13.6)
min < med < max:
0 < 3 < 99.8
IQR (CV) : 7.7 (1.8)
1506 distinct values 0
(0.0%)
12 FIBRE, TOTAL DIETARY
[numeric]
Mean (sd) : 2.4 (4.7)
min < med < max:
0 < 1 < 79
IQR (CV) : 2.7 (1.9)
238 distinct values 0
(0.0%)
13 CALCIUM
[numeric]
Mean (sd) : 76.9 (219)
min < med < max:
0 < 25 < 7364
IQR (CV) : 63 (2.8)
469 distinct values 0
(0.0%)
14 IRON
[numeric]
Mean (sd) : 2.6 (5.6)
min < med < max:
0 < 1.1 < 124
IQR (CV) : 2.1 (2.2)
883 distinct values 0
(0.0%)
15 MAGNESIUM
[numeric]
Mean (sd) : 39.7 (63.6)
min < med < max:
0 < 22 < 781
IQR (CV) : 26.7 (1.6)
303 distinct values 0
(0.0%)
16 PHOSPHORUS
[numeric]
Mean (sd) : 168 (233)
min < med < max:
0 < 136 < 9918
IQR (CV) : 172 (1.4)
621 distinct values 0
(0.0%)
17 POTASSIUM
[numeric]
Mean (sd) : 308 (441)
min < med < max:
0 < 240 < 16500
IQR (CV) : 208 (1.4)
896 distinct values 0
(0.0%)
18 SODIUM
[numeric]
Mean (sd) : 333 (1214)
min < med < max:
0 < 83 < 38758
IQR (CV) : 335 (3.6)
1100 distinct values 0
(0.0%)
19 ZINC
[numeric]
Mean (sd) : 1.6 (3)
min < med < max:
0 < 0.9 < 91
IQR (CV) : 1.7 (1.8)
696 distinct values 0
(0.0%)
20 COPPER
[numeric]
Mean (sd) : 0.2 (0.6)
min < med < max:
0 < 0.1 < 15.1
IQR (CV) : 0.2 (2.8)
788 distinct values 0
(0.0%)
21 MANGANESE
[numeric]
Mean (sd) : 0.6 (3.5)
min < med < max:
0 < 0.2 < 133
IQR (CV) : 0.6 (5.8)
1227 distinct values 0
(0.0%)
22 SELENIUM
[numeric]
Mean (sd) : 14.6 (34.3)
min < med < max:
0 < 10 < 1917
IQR (CV) : 16.7 (2.4)
615 distinct values 0
(0.0%)
23 RETINOL
[numeric]
Mean (sd) : 88.8 (802)
min < med < max:
0 < 0 < 30000
IQR (CV) : 27 (9)
327 distinct values 0
(0.0%)
24 BETA CAROTENE
[numeric]
Mean (sd) : 292 (1610)
min < med < max:
0 < 2 < 42891
IQR (CV) : 120 (5.5)
613 distinct values 0
(0.0%)
25 ALPHA-TOCOPHEROL
[numeric]
Mean (sd) : 1.2 (3.5)
min < med < max:
0 < 0.6 < 149
IQR (CV) : 1 (3)
448 distinct values 0
(0.0%)
26 VITAMIN D (INTERNATIONAL UNITS)
[numeric]
Mean (sd) : 23.9 (226)
min < med < max:
0 < 0 < 12716
IQR (CV) : 18 (9.5)
215 distinct values 0
(0.0%)
27 VITAMIN D (D2 + D3)
[numeric]
Mean (sd) : 0.6 (5.9)
min < med < max:
0 < 0 < 318
IQR (CV) : 0.4 (9.4)
130 distinct values 0
(0.0%)
28 VITAMIN C
[numeric]
Mean (sd) : 8.2 (52.3)
min < med < max:
0 < 0.2 < 1900
IQR (CV) : 4.6 (6.4)
459 distinct values 0
(0.0%)
29 THIAMIN
[numeric]
Mean (sd) : 0.2 (0.5)
min < med < max:
0 < 0.1 < 23.4
IQR (CV) : 0.2 (2.5)
813 distinct values 0
(0.0%)
30 RIBOFLAVIN
[numeric]
Mean (sd) : 0.2 (0.4)
min < med < max:
0 < 0.1 < 17.5
IQR (CV) : 0.2 (2)
710 distinct values 0
(0.0%)
31 NIACIN (NICOTINIC ACID) PREFORMED
[numeric]
Mean (sd) : 3.1 (4.3)
min < med < max:
0 < 1.8 < 128
IQR (CV) : 4.3 (1.4)
2829 distinct values 0
(0.0%)
32 TOTAL NIACIN EQUIVALENT
[numeric]
Mean (sd) : 5.2 (5.5)
min < med < max:
0 < 3.9 < 132
IQR (CV) : 6.9 (1.1)
3909 distinct values 0
(0.0%)
33 PANTOTHENIC ACID
[numeric]
Mean (sd) : 0.6 (0.9)
min < med < max:
0 < 0.6 < 21.9
IQR (CV) : 0.5 (1.3)
1317 distinct values 0
(0.0%)
34 VITAMIN B-6
[numeric]
Mean (sd) : 0.2 (1)
min < med < max:
0 < 0.1 < 68.8
IQR (CV) : 0.2 (4.3)
757 distinct values 0
(0.0%)
35 TOTAL FOLACIN
[numeric]
Mean (sd) : 37.7 (89.9)
min < med < max:
0 < 14 < 3786
IQR (CV) : 32.7 (2.4)
291 distinct values 0
(0.0%)
36 VITAMIN B-12
[numeric]
Mean (sd) : 1.1 (6.5)
min < med < max:
0 < 0.1 < 380
IQR (CV) : 1 (5.9)
900 distinct values 0
(0.0%)
37 VITAMIN K
[numeric]
Mean (sd) : 20.8 (74.6)
min < med < max:
0 < 20.8 < 1714
IQR (CV) : 19.4 (3.6)
435 distinct values 0
(0.0%)
38 FOLIC ACID
[numeric]
Mean (sd) : 8.4 (48.4)
min < med < max:
0 < 0 < 2993
IQR (CV) : 0 (5.7)
161 distinct values 0
(0.0%)
39 TRYPTOPHAN
[numeric]
Mean (sd) : 0.1 (0.1)
min < med < max:
0 < 0.1 < 1.6
IQR (CV) : 0.1 (0.8)
459 distinct values 0
(0.0%)
40 THREONINE
[numeric]
Mean (sd) : 0.5 (0.4)
min < med < max:
0 < 0.5 < 3.7
IQR (CV) : 0.5 (0.8)
1301 distinct values 0
(0.0%)
41 ISOLEUCINE
[numeric]
Mean (sd) : 0.6 (0.4)
min < med < max:
0 < 0.6 < 5
IQR (CV) : 0.5 (0.8)
1370 distinct values 0
(0.0%)
42 LEUCINE
[numeric]
Mean (sd) : 1 (0.7)
min < med < max:
0 < 1 < 7.2
IQR (CV) : 0.9 (0.8)
1835 distinct values 0
(0.0%)
43 LYSINE
[numeric]
Mean (sd) : 0.9 (0.8)
min < med < max:
0 < 0.9 < 5.8
IQR (CV) : 0.9 (0.9)
1699 distinct values 0
(0.0%)
44 METHIONINE
[numeric]
Mean (sd) : 0.3 (0.2)
min < med < max:
0 < 0.3 < 3.2
IQR (CV) : 0.3 (0.8)
860 distinct values 0
(0.0%)
45 CYSTINE
[numeric]
Mean (sd) : 0.2 (0.1)
min < med < max:
0 < 0.2 < 2.1
IQR (CV) : 0.1 (0.8)
496 distinct values 0
(0.0%)
46 PHENYLALANINE
[numeric]
Mean (sd) : 0.5 (0.4)
min < med < max:
0 < 0.5 < 5.2
IQR (CV) : 0.4 (0.7)
1275 distinct values 0
(0.0%)
47 TYROSINE
[numeric]
Mean (sd) : 0.4 (0.3)
min < med < max:
0 < 0.4 < 3.3
IQR (CV) : 0.4 (0.8)
1132 distinct values 0
(0.0%)
48 VALINE
[numeric]
Mean (sd) : 0.6 (0.5)
min < med < max:
0 < 0.6 < 6.2
IQR (CV) : 0.5 (0.8)
1429 distinct values 0
(0.0%)
49 ARGININE
[numeric]
Mean (sd) : 0.8 (0.7)
min < med < max:
0 < 0.8 < 7.4
IQR (CV) : 0.8 (0.9)
1627 distinct values 0
(0.0%)
50 HISTIDINE
[numeric]
Mean (sd) : 0.4 (0.3)
min < med < max:
0 < 0.4 < 2.3
IQR (CV) : 0.3 (0.8)
1076 distinct values 0
(0.0%)
51 ALANINE
[numeric]
Mean (sd) : 0.7 (0.5)
min < med < max:
0 < 0.7 < 8
IQR (CV) : 0.6 (0.8)
1490 distinct values 0
(0.0%)
52 ASPARTIC ACID
[numeric]
Mean (sd) : 1.2 (0.9)
min < med < max:
0 < 1.2 < 10.2
IQR (CV) : 1 (0.8)
1937 distinct values 0
(0.0%)
53 GLUTAMIC ACID
[numeric]
Mean (sd) : 2.4 (10.1)
min < med < max:
0 < 2.4 < 757
IQR (CV) : 1.7 (4.3)
2433 distinct values 0
(0.0%)
54 GLYCINE
[numeric]
Mean (sd) : 0.6 (0.6)
min < med < max:
0 < 0.6 < 19
IQR (CV) : 0.6 (0.9)
1439 distinct values 0
(0.0%)
55 PROLINE
[numeric]
Mean (sd) : 0.7 (0.5)
min < med < max:
0 < 0.7 < 12.3
IQR (CV) : 0.5 (0.8)
1420 distinct values 0
(0.0%)
56 SERINE
[numeric]
Mean (sd) : 0.5 (0.4)
min < med < max:
0 < 0.5 < 6.1
IQR (CV) : 0.4 (0.7)
1289 distinct values 0
(0.0%)
57 CHOLESTEROL
[numeric]
Mean (sd) : 41.5 (136)
min < med < max:
0 < 2 < 3100
IQR (CV) : 59 (3.3)
292 distinct values 0
(0.0%)
58 FATTY ACIDS, TRANS, TOTAL
[numeric]
Mean (sd) : 0.3 (1.1)
min < med < max:
0 < 0.3 < 37.6
IQR (CV) : 0.2 (3.6)
499 distinct values 0
(0.0%)
59 FATTY ACIDS, SATURATED, TOTAL
[numeric]
Mean (sd) : 3.1 (5.7)
min < med < max:
0 < 1.3 < 95.6
IQR (CV) : 3.3 (1.8)
2813 distinct values 0
(0.0%)
60 FATTY ACIDS, SATURATED, 8:0, OCTANOIC
[numeric]
Mean (sd) : 0 (0.2)
min < med < max:
0 < 0 < 7.5
IQR (CV) : 0 (6)
267 distinct values 0
(0.0%)
61 FATTY ACIDS, SATURATED, 10:0, DECANOIC
[numeric]
Mean (sd) : 0 (0.2)
min < med < max:
0 < 0 < 6
IQR (CV) : 0 (4.4)
351 distinct values 0
(0.0%)
62 FATTY ACIDS, SATURATED, 12:0, DODECANOIC
[numeric]
Mean (sd) : 0.2 (1.5)
min < med < max:
0 < 0 < 47
IQR (CV) : 0.2 (7.7)
449 distinct values 0
(0.0%)
63 FATTY ACIDS, SATURATED, 14:0, TETRADECANOIC
[numeric]
Mean (sd) : 0.2 (0.8)
min < med < max:
0 < 0.1 < 22.8
IQR (CV) : 0.2 (3.3)
788 distinct values 0
(0.0%)
64 FATTY ACIDS, SATURATED, 16:0, HEXADECANOIC
[numeric]
Mean (sd) : 1.7 (2.6)
min < med < max:
0 < 0.9 < 43.5
IQR (CV) : 1.8 (1.6)
2323 distinct values 0
(0.0%)
65 FATTY ACIDS, SATURATED, 18:0, OCTADECANOIC
[numeric]
Mean (sd) : 0.8 (1.6)
min < med < max:
0 < 0.4 < 33.2
IQR (CV) : 0.8 (1.9)
1676 distinct values 0
(0.0%)
66 FATTY ACIDS, MONOUNSATURATED, 18:1undifferentiated, OCTADECENOIC
[numeric]
Mean (sd) : 3.5 (6.8)
min < med < max:
0 < 1.4 < 82.6
IQR (CV) : 3.4 (1.9)
2709 distinct values 0
(0.0%)
67 FATTY ACIDS, POLYUNSATURATED, 18:2undifferentiated, LINOLEIC, OCTADECADIENOIC
[numeric]
Mean (sd) : 1.8 (4.4)
min < med < max:
0 < 0.5 < 74.6
IQR (CV) : 1.7 (2.4)
2080 distinct values 0
(0.0%)
68 FATTY ACIDS, POLYUNSATURATED, 18:3undifferentiated, LINOLENIC, OCTADECATRIENOIC
[numeric]
Mean (sd) : 0.2 (1.2)
min < med < max:
0 < 0.1 < 53.4
IQR (CV) : 0.2 (5.6)
690 distinct values 0
(0.0%)
69 FATTY ACIDS, POLYUNSATURATED, 20:4, EICOSATETRAENOIC
[numeric]
Mean (sd) : 0 (0.1)
min < med < max:
0 < 0 < 1.8
IQR (CV) : 0 (2.3)
262 distinct values 0
(0.0%)
70 FATTY ACIDS, POLYUNSATURATED, 22:6 n-3, DOCOSAHEXAENOIC (DHA)
[numeric]
Mean (sd) : 0 (0.5)
min < med < max:
0 < 0 < 18.2
IQR (CV) : 0 (9.5)
297 distinct values 0
(0.0%)
71 FATTY ACIDS, MONOUNSATURATED, 16:1undifferentiated, HEXADECENOIC
[numeric]
Mean (sd) : 0.2 (0.9)
min < med < max:
0 < 0.1 < 18.9
IQR (CV) : 0.2 (3.7)
771 distinct values 0
(0.0%)
72 FATTY ACIDS, POLYUNSATURATED, 18:4, OCTADECATETRAENOIC
[numeric]
Mean (sd) : 0 (0.1)
min < med < max:
0 < 0 < 3
IQR (CV) : 0 (8.4)
127 distinct values 0
(0.0%)
73 FATTY ACIDS, POLYUNSATURATED, 20:5 n-3, EICOSAPENTAENOIC (EPA)
[numeric]
Mean (sd) : 0 (0.4)
min < med < max:
0 < 0 < 13.2
IQR (CV) : 0 (8.5)
260 distinct values 0
(0.0%)
74 FATTY ACIDS, MONOUNSATURATED, 22:1undifferentiated, DOCOSENOIC
[numeric]
Mean (sd) : 0 (0.7)
min < med < max:
0 < 0 < 41.2
IQR (CV) : 0 (14.8)
200 distinct values 0
(0.0%)
75 FATTY ACIDS, POLYUNSATURATED, 22:5 n-3, DOCOSAPENTAENOIC (DPA)
[numeric]
Mean (sd) : 0 (0.2)
min < med < max:
0 < 0 < 5.6
IQR (CV) : 0 (10.9)
163 distinct values 0
(0.0%)
76 FATTY ACIDS, MONOUNSATURATED, TOTAL
[numeric]
Mean (sd) : 3.9 (7.6)
min < med < max:
0 < 1.4 < 83.7
IQR (CV) : 4.2 (1.9)
2881 distinct values 0
(0.0%)
77 FATTY ACIDS, POLYUNSATURATED, TOTAL
[numeric]
Mean (sd) : 2.2 (5.1)
min < med < max:
0 < 0.7 < 74.6
IQR (CV) : 2 (2.3)
2382 distinct values 0
(0.0%)
78 NATURALLY OCCURRING FOLATE
[numeric]
Mean (sd) : 29.2 (71.9)
min < med < max:
0 < 11 < 2340
IQR (CV) : 24.2 (2.5)
262 distinct values 0
(0.0%)
79 RETINOL ACTIVITY EQUIVALENTS
[numeric]
Mean (sd) : 115 (817)
min < med < max:
0 < 5 < 30000
IQR (CV) : 43 (7.1)
464 distinct values 0
(0.0%)
80 DIETARY FOLATE EQUIVALENTS
[numeric]
Mean (sd) : 44.4 (114)
min < med < max:
0 < 15 < 5881
IQR (CV) : 39.4 (2.6)
333 distinct values 0
(0.0%)
81 FATTY ACIDS, POLYUNSATURATED, 18:2 c,c n-6, LINOLEIC, OCTADECADIENOIC
[numeric]
Mean (sd) : 2.3 (3.9)
min < med < max:
0 < 2.3 < 74.6
IQR (CV) : 1.5 (1.7)
1286 distinct values 0
(0.0%)
82 FATTY ACIDS, POLYUNSATURATED, 20:3, EICOSATRIENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 1.4
IQR (CV) : 0 (5.5)
90 distinct values 0
(0.0%)
83 FATTY ACIDS, POLYUNSATURATED, 18:3 c,c,c n-3 LINOLENIC, OCTADECATRIENOIC
[numeric]
Mean (sd) : 0.2 (1.1)
min < med < max:
0 < 0.1 < 53.4
IQR (CV) : 0.2 (5.6)
620 distinct values 0
(0.0%)
84 FATTY ACIDS, POLYUNSATURATED, 18:3 c,c,c n-6, g-LINOLENIC, OCTADECATRIENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 1
IQR (CV) : 0 (17.6)
50 distinct values 0
(0.0%)
85 BETA CRYPTOXANTHIN
[numeric]
Mean (sd) : 15.2 (152)
min < med < max:
0 < 1 < 6252
IQR (CV) : 15.2 (10)
129 distinct values 0
(0.0%)
86 LYCOPENE
[numeric]
Mean (sd) : 220 (1390)
min < med < max:
0 < 0 < 46260
IQR (CV) : 220 (6.3)
191 distinct values 0
(0.0%)
87 LUTEIN AND ZEAXANTHIN
[numeric]
Mean (sd) : 260 (1063)
min < med < max:
0 < 104 < 19697
IQR (CV) : 260 (4.1)
420 distinct values 0
(0.0%)
88 FATTY ACIDS, POLYUNSATURATED, 20:3 n-6, EICOSATRIENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 1.4
IQR (CV) : 0 (13.9)
72 distinct values 0
(0.0%)
89 FATTY ACIDS, POLYUNSATURATED, 20:4 n-6, ARACHIDONIC
[numeric]
Mean (sd) : 0 (0.1)
min < med < max:
0 < 0 < 1.8
IQR (CV) : 0 (2)
228 distinct values 0
(0.0%)
90 FATTY ACIDS, POLYUNSATURATED, 20:3 n-3 EICOSATRIENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 1
IQR (CV) : 0 (15.9)
53 distinct values 0
(0.0%)
91 VITAMIN B12, ADDED
[numeric]
Mean (sd) : 1 (5)
min < med < max:
0 < 1 < 380
IQR (CV) : 0 (5.2)
29 distinct values 0
(0.0%)
92 ALPHA-TOCOPHEROL, ADDED
[numeric]
Mean (sd) : 0.1 (0.3)
min < med < max:
0 < 0.1 < 16.9
IQR (CV) : 0 (3.4)
12 distinct values 0
(0.0%)
93 VITAMIN D2, ERGOCALCIFEROL
[numeric]
Mean (sd) : 0.3 (0.5)
min < med < max:
0 < 0.3 < 28.1
IQR (CV) : 0 (1.5)
23 distinct values 0
(0.0%)
94 FATTY ACIDS, SATURATED, 4:0, BUTANOIC
[numeric]
Mean (sd) : 0 (0.1)
min < med < max:
0 < 0 < 3.2
IQR (CV) : 0 (4.1)
275 distinct values 0
(0.0%)
95 FATTY ACIDS, SATURATED, 6:0, HEXANOIC
[numeric]
Mean (sd) : 0 (0.1)
min < med < max:
0 < 0 < 2
IQR (CV) : 0 (4)
225 distinct values 0
(0.0%)
96 ALPHA CAROTENE
[numeric]
Mean (sd) : 40.8 (297)
min < med < max:
0 < 1 < 14251
IQR (CV) : 40.8 (7.3)
165 distinct values 0
(0.0%)
97 FATTY ACIDS, MONOUNSATURATED, 22:1c, DOCOSENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 1.1
IQR (CV) : 0 (4)
101 distinct values 0
(0.0%)
98 FATTY ACIDS, POLYUNSATURATED, 18:3i, LINOLENIC, OCTADECATRIENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0.3
IQR (CV) : 0 (2.2)
55 distinct values 0
(0.0%)
99 FATTY ACIDS, MONOUNSATURATED, 22:1t, DOCOSENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0.1
IQR (CV) : 0 (12.8)
17 distinct values 0
(0.0%)
100 SUCROSE
[numeric]
Mean (sd) : 2 (5)
min < med < max:
0 < 2 < 99.8
IQR (CV) : 2 (2.5)
488 distinct values 0
(0.0%)
101 GLUCOSE
[numeric]
Mean (sd) : 0.8 (1.7)
min < med < max:
0 < 0.8 < 35.8
IQR (CV) : 0.8 (2.2)
400 distinct values 0
(0.0%)
102 FRUCTOSE
[numeric]
Mean (sd) : 0.7 (1.7)
min < med < max:
0 < 0.7 < 55.6
IQR (CV) : 0.7 (2.4)
388 distinct values 0
(0.0%)
103 LACTOSE
[numeric]
Mean (sd) : 0.3 (0.8)
min < med < max:
0 < 0.3 < 13.2
IQR (CV) : 0.3 (2.7)
226 distinct values 0
(0.0%)
104 MALTOSE
[numeric]
Mean (sd) : 0.2 (0.5)
min < med < max:
0 < 0.2 < 16.4
IQR (CV) : 0.2 (2.6)
218 distinct values 0
(0.0%)
105 GALACTOSE
[numeric]
Mean (sd) : 0 (0.3)
min < med < max:
0 < 0 < 19.9
IQR (CV) : 0 (9.5)
54 distinct values 0
(0.0%)
106 FATTY ACIDS, SATURATED, 20:0, EICOSANOIC
[numeric]
Mean (sd) : 0 (0.1)
min < med < max:
0 < 0 < 4.6
IQR (CV) : 0 (2.7)
184 distinct values 0
(0.0%)
107 FATTY ACIDS, SATURATED, 22:0, DOCOSANOIC
[numeric]
Mean (sd) : 0 (0.1)
min < med < max:
0 < 0 < 3.7
IQR (CV) : 0 (3.4)
134 distinct values 0
(0.0%)
108 FATTY ACIDS, MONOUNSATURATED, 14:1, TETRADECENOIC
[numeric]
Mean (sd) : 0 (0.1)
min < med < max:
0 < 0 < 1.8
IQR (CV) : 0 (2.4)
157 distinct values 0
(0.0%)
109 FATTY ACIDS, MONOUNSATURATED, 20:1, EICOSENOIC
[numeric]
Mean (sd) : 0.1 (0.5)
min < med < max:
0 < 0 < 15
IQR (CV) : 0.1 (5.3)
366 distinct values 0
(0.0%)
110 FATTY ACIDS, SATURATED, 15:0, PENTADECANOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0.9
IQR (CV) : 0 (1.7)
122 distinct values 0
(0.0%)
111 FATTY ACIDS, SATURATED, 17:0, HEPTADECANOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0.8
IQR (CV) : 0 (1.2)
190 distinct values 0
(0.0%)
112 FATTY ACIDS, SATURATED, 24:0, TETRACOSANOIC
[numeric]
Mean (sd) : 0 (0.1)
min < med < max:
0 < 0 < 4.7
IQR (CV) : 0 (5.2)
92 distinct values 0
(0.0%)
113 STARCH
[numeric]
Mean (sd) : 4 (6.7)
min < med < max:
0 < 4 < 73.3
IQR (CV) : 4 (1.7)
361 distinct values 0
(0.0%)
114 BETA-TOCOPHEROL
[numeric]
Mean (sd) : 0.1 (0.2)
min < med < max:
0 < 0.1 < 10.5
IQR (CV) : 0 (1.9)
66 distinct values 0
(0.0%)
115 GAMMA-TOCOPHEROL
[numeric]
Mean (sd) : 2.3 (2.1)
min < med < max:
0 < 2.3 < 65.2
IQR (CV) : 0 (0.9)
275 distinct values 0
(0.0%)
116 DELTA-TOCOPHEROL
[numeric]
Mean (sd) : 0.4 (0.5)
min < med < max:
0 < 0.4 < 15.4
IQR (CV) : 0 (1.2)
149 distinct values 0
(0.0%)
117 FATTY ACIDS, MONOUNSATURATED, 16:1t, HEXADECENOIC
[numeric]
Mean (sd) : 0 (0.1)
min < med < max:
0 < 0 < 6.1
IQR (CV) : 0 (10.8)
74 distinct values 0
(0.0%)
118 FATTY ACIDS, MONOUNSATURATED, 18:1t, OCTADECENOIC
[numeric]
Mean (sd) : 0.1 (0.4)
min < med < max:
0 < 0.1 < 20.2
IQR (CV) : 0 (2.9)
296 distinct values 0
(0.0%)
119 FATTY ACIDS, POLYUNSATURATED, 18:2i, LINOLEIC, OCTADECADIENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 2.3
IQR (CV) : 0 (1.8)
141 distinct values 0
(0.0%)
120 FATTY ACIDS, MONOUNSATURATED, 24:1c, TETRACOSENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0.6
IQR (CV) : 0 (4.1)
46 distinct values 0
(0.0%)
121 FATTY ACIDS, MONOUNSATURATED, 16:1c, HEXADECENOIC
[numeric]
Mean (sd) : 0.1 (0.2)
min < med < max:
0 < 0.1 < 6.9
IQR (CV) : 0 (1.3)
397 distinct values 0
(0.0%)
122 FATTY ACIDS, POLYUNSATURATED, 20:2 c,c EICOSADIENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0.7
IQR (CV) : 0 (1.7)
129 distinct values 0
(0.0%)
123 FATTY ACIDS, MONOUNSATURATED, 18:1c, OCTADECENOIC
[numeric]
Mean (sd) : 4.7 (37.8)
min < med < max:
0 < 4.7 < 2845
IQR (CV) : 0 (8.1)
1067 distinct values 0
(0.0%)
124 FATTY ACIDS, MONOUNSATURATED, 17:1, HEPTADECENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 1.1
IQR (CV) : 0 (1.5)
136 distinct values 0
(0.0%)
125 FATTY ACIDS, TOTAL TRANS-MONOENOIC
[numeric]
Mean (sd) : 0.1 (0.4)
min < med < max:
0 < 0.1 < 20.2
IQR (CV) : 0 (3.1)
286 distinct values 0
(0.0%)
126 FATTY ACIDS, MONOUNSATURATED, 15:1, PENTADECENOIC
[numeric]
Mean (sd) : 0 (0.1)
min < med < max:
0 < 0 < 6
IQR (CV) : 0 (15)
28 distinct values 0
(0.0%)
127 FATTY ACIDS, POLYUNSATURATED, CONJUGATED, 18:2 cla, LINOLEIC, OCTADECADIENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 1.1
IQR (CV) : 0 (2.2)
91 distinct values 0
(0.0%)
128 FATTY ACIDS, POLYUNSATURATED, 22:4 n-6, DOCOSATETRAENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0.3
IQR (CV) : 0 (1.6)
67 distinct values 0
(0.0%)
129 FATTY ACIDS, TOTAL TRANS-POLYENOIC
[numeric]
Mean (sd) : 0 (0.1)
min < med < max:
0 < 0 < 2.5
IQR (CV) : 0 (1.8)
155 distinct values 0
(0.0%)
130 CHOLINE, TOTAL
[numeric]
Mean (sd) : 39.1 (50.2)
min < med < max:
0 < 39.1 < 2403
IQR (CV) : 20.4 (1.3)
911 distinct values 0
(0.0%)
131 BETAINE
[numeric]
Mean (sd) : 10.6 (13.8)
min < med < max:
0 < 10.6 < 630
IQR (CV) : 0 (1.3)
258 distinct values 0
(0.0%)
132 FATTY ACIDS, POLYUNSATURATED, TOTAL OMEGA N-3
[numeric]
Mean (sd) : 0.5 (1.4)
min < med < max:
0 < 0.5 < 53.4
IQR (CV) : 0.3 (2.9)
549 distinct values 0
(0.0%)
133 FATTY ACIDS, POLYUNSATURATED, TOTAL OMEGA N-6
[numeric]
Mean (sd) : 3.1 (13.8)
min < med < max:
0 < 3.1 < 953
IQR (CV) : 1.7 (4.5)
1056 distinct values 0
(0.0%)
134 ASPARTAME
[numeric]
Mean (sd) : 51.1 (49.6)
min < med < max:
0 < 51.1 < 3722
IQR (CV) : 0 (1)
0.00 : 82 ( 1.4%)
37.00 : 1 ( 0.0%)
42.00 : 1 ( 0.0%)
51.15!: 5603 (98.5%)
52.00 : 1 ( 0.0%)
597.00 : 1 ( 0.0%)
3722.00 : 1 ( 0.0%)
! rounded


0
(0.0%)
135 TOTAL PLANT STEROL
[numeric]
Mean (sd) : 26.4 (28.2)
min < med < max:
0 < 26.4 < 1190
IQR (CV) : 0 (1.1)
118 distinct values 0
(0.0%)
136 MANNITOL
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0.2
IQR (CV) : 0 (8.6)
4 distinct values 0
(0.0%)
137 SORBITOL
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 2.3
IQR (CV) : 0 (6.9)
0.00 : 1375 (24.2%)
0.01!: 4304 (75.6%)
0.10 : 1 ( 0.0%)
0.20 : 1 ( 0.0%)
0.30 : 1 ( 0.0%)
0.60 : 2 ( 0.0%)
0.80 : 1 ( 0.0%)
1.00 : 3 ( 0.1%)
2.10 : 1 ( 0.0%)
2.30 : 1 ( 0.0%)
! rounded


0
(0.0%)
138 STIGMASTEROL
[numeric]
Mean (sd) : 1.3 (1.8)
min < med < max:
0 < 1.3 < 59
IQR (CV) : 0 (1.4)
27 distinct values 0
(0.0%)
139 TOTAL MONOSACCARIDES
[numeric]
Mean (sd) : 0.8 (1.5)
min < med < max:
0 < 0.8 < 30.6
IQR (CV) : 0.7 (1.9)
268 distinct values 0
(0.0%)
140 TOTAL DISACCHARIDES
[numeric]
Mean (sd) : 1.5 (2.8)
min < med < max:
0 < 1.5 < 47.2
IQR (CV) : 1.3 (1.9)
296 distinct values 0
(0.0%)
141 BETA-SITOSTEROL
[numeric]
Mean (sd) : 14 (14)
min < med < max:
0 < 14 < 621
IQR (CV) : 0 (1)
55 distinct values 0
(0.0%)
142 HYDROXYPROLINE
[numeric]
Mean (sd) : 0.1 (0)
min < med < max:
0 < 0.1 < 0.7
IQR (CV) : 0 (0.4)
198 distinct values 0
(0.0%)
143 FATTY ACIDS, SATURATED, 13:0 TRIDECANOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0.1
IQR (CV) : 0 (3)
11 distinct values 0
(0.0%)
144 FATTY ACIDS, POLYUNSATURATED, 21:5
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0.3
IQR (CV) : 0 (4.7)
13 distinct values 0
(0.0%)
145 FATTY ACIDS, MONOUNSATURATED, 24:1undifferentiated, TETRACOSENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 3
IQR (CV) : 0 (8.9)
33 distinct values 0
(0.0%)
146 FATTY ACIDS, MONOUNSATURATED, 12:1, LAUROLEIC
[numeric]
1 distinct value 0 : 5690 (100.0%) 0
(0.0%)
147 FATTY ACIDS, POLYUNSATURATED, 22:3,
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0.1
IQR (CV) : 0 (5.5)
11 distinct values 0
(0.0%)
148 FATTY ACIDS, POLYUNSATURATED, 22:2, DOCOSADIENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0
IQR (CV) : 0 (6.5)
5 distinct values 0
(0.0%)
149 FATTY ACIDS, POLYUNSATURATED, 18:2t,t , OCTADECADIENENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0.5
IQR (CV) : 0 (2.5)
60 distinct values 0
(0.0%)
150 CAMPESTEROL
[numeric]
Mean (sd) : 3.8 (3.6)
min < med < max:
0 < 3.8 < 189
IQR (CV) : 0 (0.9)
28 distinct values 0
(0.0%)
151 BIOTIN
[numeric]
Mean (sd) : 6.1 (0.9)
min < med < max:
0 < 6.1 < 31.6
IQR (CV) : 0 (0.1)
72 distinct values 0
(0.0%)
152 NA
[numeric]
1 distinct value 0.10 : 5690 (100.0%) 0
(0.0%)
153 OXALIC ACID
[numeric]
Mean (sd) : 0.3 (0)
min < med < max:
0 < 0.3 < 1.7
IQR (CV) : 0 (0.1)
28 distinct values 0
(0.0%)

Missing values in clean dataset

Optimal number of clusters for k-means clustering on full dataset

Word clouds for k-means clustering on full dataset

Cluster 1

Cluster 2

Cluster 3

Cluster 4

### Cluster 5

### Cluster 6

### Cluster 7

### Cluster 8

Scree plot from PCA in spectrum clustering on full dataset

Optimal number of clusters for spectrum clustering on 10 PCs on full dataset

Scree plot from PCA in spectrum clustering on theoretically-driven nutrients

Optimal number of clusters for spectrum clustering on theoretically-driven nutrients

Most important nutrients for spectrum clustering on theoretically-driven nutrients

Cluster 1

Cluster 2

Citations for packages used

  • factoextra (version 1.0.7; Alboukadel Kassambara and Fabian Mundt, 2020)
  • GGally (version 2.1.1; Barret Schloerke et al., 2021)
  • gtsummary (version 1.3.7; Daniel Sjoberg et al., 2021)
  • summarytools (version 0.9.9; Dominic Comtois, 2021)
  • Matrix (version 1.3.2; Douglas Bates and Martin Maechler, 2021)
  • RColorBrewer (version 1.1.2; Erich Neuwirth, 2014)
  • clValid (version 0.7; Guy Brock et al., 2008)
  • ggplot2 (version 3.3.3; Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.)
  • stringr (version 1.4.0; Hadley Wickham, 2019)
  • forcats (version 0.5.1; Hadley Wickham, 2021)
  • tidyr (version 1.1.3; Hadley Wickham, 2021)
  • readr (version 1.4.0; Hadley Wickham and Jim Hester, 2020)
  • dplyr (version 1.0.5; Hadley Wickham et al., 2021)
  • stargazer (version 5.2.2; Hlavac, Marek, 2018)
  • wordcloud (version 2.6; Ian Fellows, 2018)
  • tm (version 0.7.8; Ingo Feinerer and Kurt Hornik, 2020)
  • glmnet (version 4.1.1; Jerome Friedman et al., 2010)
  • car (version 3.0.10; John Fox and Sanford Weisberg, 2019)
  • carData (version 3.0.4; John Fox, Sanford Weisberg and Brad Price, 2020)
  • here (version 1.0.1; Kirill Müller, 2020)
  • tibble (version 3.1.0; Kirill Müller and Hadley Wickham, 2021)
  • NLP (version 0.2.1; Kurt Hornik, 2020)
  • purrr (version 0.3.4; Lionel Henry and Hadley Wickham, 2020)
  • sjPlot (version 2.8.7; Lüdecke D, 2021)
  • cluster (version 2.1.0; Maechler et al., 2019)
  • report (version 0.3.0.9000; Makowski et al., 2020)
  • data.table (version 1.14.0; Matt Dowle and Arun Srinivasan, 2021)
  • varhandle (version 2.0.5; Mehrad Mahmoudian, 2020)
  • SnowballC (version 0.7.0; Milan Bouchet-Valat, 2020)
  • imputeTS (version 3.2; Moritz S, Bartz-Beielstein T, 2017)
  • R (version 4.0.4; R Core Team, 2021)
  • pacman (version 0.5.1; Rinker et al., 2017)
  • corrplot (version 0.84; Taiyun Wei and Viliam Simko, 2017)
  • tidyverse (version 1.3.0; Wickham et al., 2019)
  • pROC (version 1.17.0.1; Xavier Robin et al., 2011) - Alboukadel Kassambara and Fabian Mundt (2020). factoextra: Extract and Visualize the Results of Multivariate Data Analyses. R package version 1.0.7. https://CRAN.R-project.org/package=factoextra
  • Barret Schloerke, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg and Jason Crowley (2021). GGally: Extension to ‘ggplot2’. R package version 2.1.1. https://CRAN.R-project.org/package=GGally
  • Daniel D. Sjoberg, Michael Curry, Margie Hannum, Karissa Whiting and Emily C. Zabor (2021). gtsummary: Presentation-Ready Data Summary and Analytic Result Tables. R package version 1.3.7. https://CRAN.R-project.org/package=gtsummary
  • Dominic Comtois (2021). summarytools: Tools to Quickly and Neatly Summarize Data. R package version 0.9.9. https://CRAN.R-project.org/package=summarytools
  • Douglas Bates and Martin Maechler (2021). Matrix: Sparse and Dense Matrix Classes and Methods. R package version 1.3-2. https://CRAN.R-project.org/package=Matrix
  • Erich Neuwirth (2014). RColorBrewer: ColorBrewer Palettes. R package version 1.1-2. https://CRAN.R-project.org/package=RColorBrewer
  • Guy Brock, Vasyl Pihur, Susmita Datta, Somnath Datta (2008). clValid: An R Package for Cluster Validation. Journal of Statistical Software, 25(4), 1-22. URL https://www.jstatsoft.org/v25/i04/
  • H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
  • Hadley Wickham (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0. https://CRAN.R-project.org/package=stringr
  • Hadley Wickham (2021). forcats: Tools for Working with Categorical Variables (Factors). R package version 0.5.1. https://CRAN.R-project.org/package=forcats
  • Hadley Wickham (2021). tidyr: Tidy Messy Data. R package version 1.1.3. https://CRAN.R-project.org/package=tidyr
  • Hadley Wickham and Jim Hester (2020). readr: Read Rectangular Text Data. R package version 1.4.0. https://CRAN.R-project.org/package=readr
  • Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2021). dplyr: A Grammar of Data Manipulation. R package version 1.0.5. https://CRAN.R-project.org/package=dplyr
  • Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables. R package version 5.2.1. https://CRAN.R-project.org/package=stargazer
  • Ian Fellows (2018). wordcloud: Word Clouds. R package version 2.6. https://CRAN.R-project.org/package=wordcloud
  • Ingo Feinerer and Kurt Hornik (2020). tm: Text Mining Package. R package version 0.7-8. https://CRAN.R-project.org/package=tm
  • Jerome Friedman, Trevor Hastie, Robert Tibshirani (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1-22. URL https://www.jstatsoft.org/v33/i01/.
  • John Fox and Sanford Weisberg (2019). An {R} Companion to Applied Regression, Third Edition. Thousand Oaks CA: Sage. URL: https://socialsciences.mcmaster.ca/jfox/Books/Companion/
  • John Fox, Sanford Weisberg and Brad Price (2020). carData: Companion to Applied Regression Data Sets. R package version 3.0-4. https://CRAN.R-project.org/package=carData
  • Kirill Müller (2020). here: A Simpler Way to Find Your Files. R package version 1.0.1. https://CRAN.R-project.org/package=here
  • Kirill Müller and Hadley Wickham (2021). tibble: Simple Data Frames. R package version 3.1.0. https://CRAN.R-project.org/package=tibble
  • Kurt Hornik (2020). NLP: Natural Language Processing Infrastructure. R package version 0.2-1. https://CRAN.R-project.org/package=NLP
  • Lionel Henry and Hadley Wickham (2020). purrr: Functional Programming Tools. R package version 0.3.4. https://CRAN.R-project.org/package=purrr
  • Lüdecke D (2021). sjPlot: Data Visualization for Statistics in SocialScience. R package version 2.8.7, <URL:https://CRAN.R-project.org/package=sjPlot>.
  • Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.(2019). cluster: Cluster Analysis Basics and Extensions. R package version 2.1.0.
  • Makowski, D., Ben-Shachar, M.S., Patil, I. & Lüdecke, D. (2020). Automated Results Reporting as a Practical Tool to Improve Reproducibility and Methodological Best Practices Adoption. CRAN. Available from https://github.com/easystats/report. doi: .
  • Matt Dowle and Arun Srinivasan (2021). data.table: Extension of data.frame. R package version 1.14.0. https://CRAN.R-project.org/package=data.table
  • Mehrad Mahmoudian (2020). varhandle: Functions for Robust Variable Handling. R package version 2.0.5. https://CRAN.R-project.org/package=varhandle
  • Milan Bouchet-Valat (2020). SnowballC: Snowball Stemmers Based on the C ‘libstemmer’ UTF-8 Library. R package version 0.7.0. https://CRAN.R-project.org/package=SnowballC
  • Moritz S, Bartz-Beielstein T (2017). “imputeTS: Time Series MissingValue Imputation in R.” The R Journal, 9(1), 207-218. doi:10.32614/RJ-2017-009 (URL: https://doi.org/10.32614/RJ-2017-009).
  • R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
  • Rinker, T. W. & Kurkiewicz, D. (2017). pacman: Package Management for R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman
  • Taiyun Wei and Viliam Simko (2017). R package “corrplot”: Visualization of a Correlation Matrix (Version 0.84). Available from https://github.com/taiyun/corrplot
  • Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
  • Xavier Robin, Natacha Turck, Alexandre Hainard, Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez and Markus Müller (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, p. 77. DOI: 10.1186/1471-2105-12-77 http://www.biomedcentral.com/1471-2105/12/77/

References

Dani, Jennifer, Courtney Burrill, and Barbara Demmig-Adams. 2005. “The remarkable role of nutrition in learning and behaviour.” Nutrition and Food Science 35 (4): 258–63. https://doi.org/10.1108/00346650510605658.

Popkin, B. M., and P. Gordon-Larsen. 2004. “The nutrition transition: Worldwide obesity dynamics and their determinants.” International Journal of Obesity 28: S2–S9. https://doi.org/10.1038/sj.ijo.0802804.

Shanta Retelny, Victoria, Annie Neuendorf, and Julie L. Roth. 2008. “Nutrition protocols for the prevention of cardiovascular Disease.” Nutrition in Clinical Practice 23 (5): 468–76. https://doi.org/10.1177/0884533608323425.